feat: expand eval dataset with edge and complex cases and refine prompts#458
Open
cocosheng-g wants to merge 33 commits intomainfrom
Open
feat: expand eval dataset with edge and complex cases and refine prompts#458cocosheng-g wants to merge 33 commits intomainfrom
cocosheng-g wants to merge 33 commits intomainfrom
Conversation
- Implement Isolated `TestRig` for environment-safe, concurrent evaluations. - Add gold-standard datasets for Issue Triage, Scheduled Triage, Assistant, and Issue Fixer. - Implement Mock MCP Server for high-fidelity PR Review benchmarking. - Add nightly evaluation workflow with multi-model strategy matrix. - Automated aggregate reporting for GitHub Job Summaries. Next Steps: - Expand evaluation datasets with more edge cases. - Fine-tune workflow prompts based on baseline quality analysis. Refs: #219
- Added 30+ cases (edge, complex, real-life) across gemini-triage, gemini-issue-fixer, and gemini-review. - Refined triage prompt to handle spam, ambiguity, and vague reports more robustly. - Added a validation step to issue-fixer prompt to handle impossible or out-of-scope requests. - Updated mock MCP server to support new evaluation scenarios including race conditions and architectural violations. - Improved evaluation scripts for better tool call detection in namespaces. - Verified all evaluations pass with the updated prompts.
Contributor
|
🤖 Hi @cocosheng-g, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
- Update triage guidelines for stricter handling of spam and ambiguity. - Refine fixer validation step to use explicit keywords for out-of-scope cases. - Improves evaluation pass rates for edge cases.
Resolved conflicts: - package.json: Use 'vitest' directly in test script (from main) - .github/workflows/evals-nightly.yml: Use 'Install Gemini CLI' step and 'always()' condition (from main) - evals/data/*.json: Keep expanded datasets (from HEAD) - evals/pr-review.eval.ts: Keep updated test logic (from HEAD) - evals/mock-mcp-server.ts: Manually merged new mock data and tool handlers
- Run tests sequentially to reduce flakiness and avoid API rate limits. - Enable mock GitHub MCP server for issue-fixer evaluation to match prompt instructions. - Proactively create 'chats' directory in test rig to prevent 'ENOENT' errors during chat recording. - Refine structural checks to handle out-of-scope/impossible requests and account for alternative git/issue tool usage. - Update expected plan keywords in evaluation datasets.
…vals" This reverts commit 1bb9df0.
… standard runners
… improve pass rate
- Broaden hasExploration check in issue-fixer.eval.ts to include MCP/extension tools. - Add search_code and get_file_contents to mock-mcp-server.ts. - Add a 2s delay before reading telemetry logs across all evals to prevent race conditions in CI. - Fixes failures observed with gemini-3-pro-preview in CI.
- Increase testTimeout to 15m to handle complex cross-file refactor tasks. - Add 'search' to tool exploration keywords for broader detection.
Signed-off-by: Coco Sheng <cocosheng@google.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR continues the work on issue #219 by expanding the evaluation datasets and refining the workflow prompts.
📊 Evaluation Results (Post-Tuning)
Changes:
Expanded Evaluation Datasets: Added 30+ edge, complex, and real-life cases across triage, fixer, and pr-review.
Prompt Refinements:
Verification: All evaluations have been verified to pass.